Add SQLite caching with --no-cache and --force-check flags (#2219) #2608

obiwan04kanobi · 2025-10-05T07:13:33Z

Description

Implements SQLite-based result caching as discussed in #2219 to improve performance and reduce rate limiting when performing multiple username lookups.

Changes

Core Caching Implementation

New module: sherlock_project/cache.py - SQLite cache manager with TTL support
Modified: sherlock_project/sherlock.py - Integrated cache into main sherlock function
Cache location: ~/.sherlock/cache.db (automatic directory creation)
Default TTL: 24 hours (86400 seconds, configurable)

CLI Arguments Added

--no-cache - Disable caching completely (don't read or write to cache)
--force-check - Ignore cached results and force fresh checks for all sites
--cache-duration SECONDS - Customize cache expiration time (default: 86400)

Cache Management Utility

New CLI tool: sherlock-cache with subcommands:
- stats - Show cache statistics (total/valid/expired entries, cache path)
- clear - Clear cache entries (all, by username, or by site)
- cleanup - Remove only expired entries

Testing & Documentation

New tests: tests/test_cache.py - Comprehensive cache functionality tests
Updated docs: docs/README.md - Added cache usage and management section
All tests passing: 5/5 cache tests pass, existing tests unaffected

Usage Examples

# Default behavior (uses cache)
sherlock username

# Ignore cache and force fresh checks
sherlock username --force-check

# Disable caching completely
sherlock username --no-cache

# Custom cache duration (1 hour)
sherlock username --cache-duration 3600

# Cache management
sherlock-cache stats              # Show statistics
sherlock-cache clear              # Clear all cache
sherlock-cache clear --username user123
sherlock-cache cleanup            # Remove expired only

Performance Impact

First run: Same speed (queries all sites, writes to cache)
Subsequent runs: Near-instant for cached results (~100x faster)
Storage: Minimal (~1-5KB per username depending on results)
No breaking changes: Cache is opt-out via --no-cache

Implementation Details

Cache Strategy

Stores CLAIMED and AVAILABLE status results
Does not cache UNKNOWN, ILLEGAL, or WAF statuses
Automatic cleanup of expired entries on each run
Cache key: (username, site) tuple (prevents conflicts)

Database Schema

CREATE TABLE results (
    username TEXT NOT NULL,
    site TEXT NOT NULL,
    status TEXT NOT NULL,
    url TEXT,
    timestamp INTEGER NOT NULL,
    PRIMARY KEY (username, site)
);
CREATE INDEX idx_timestamp ON results(timestamp);

Testing

# Run cache tests
poetry run pytest tests/test_cache.py -v

# Run all tests
poetry run pytest -v

Test Results: ✅ All 5 cache tests passing

test_cache_set_and_get - Basic cache operations
test_cache_expiration - TTL functionality
test_cache_clear_all - Clear entire cache
test_cache_clear_username - Clear by username
test_cache_stats - Statistics generation

Related Issue

Closes #2219

Checklist

Code of Conduct

I agree to follow this project's Code of Conduct

Implements SQLite-based result caching to improve performance and reduce rate limiting. Results are cached for 24 hours by default and stored in ~/.sherlock/cache.db. Features: - Automatic caching of username lookup results with configurable TTL - --no-cache flag to disable caching completely - --force-check flag to ignore cached results and force fresh lookups - --cache-duration flag to customize cache expiration (default: 86400s) - sherlock-cache CLI utility for cache management (stats, clear, cleanup) - Comprehensive test suite for cache functionality Technical Details: - Cache stored in SQLite database at ~/.sherlock/cache.db - Automatic cleanup of expired entries on each run - Caches both CLAIMED and AVAILABLE status results - Thread-safe database operations - Zero dependencies (uses built-in sqlite3 module) Resolves sherlock-project#2219

obiwan04kanobi · 2025-10-05T07:24:20Z

This PR is part of Hacktoberfest 2025 contribution efforts.

ppfeister

Additionally, have any checks been performed for the possibility of SQLi?

This was not a complete review, as that'll take some time -- just some comments at first glance.

[PEP 8] (second bullet point, primarily, with a blank line separator between sections and two lines after the final import)

tests/test_cache.py

pyproject.toml

sherlock_project/cache_cli.py

sherlock_project/sherlock.py

sherlock_project/cache_cli.py

sherlock_project/cache.py

@ppfeister

Addresses all feedback from PR sherlock-project#2608 review by @ppfeister Security Hardening: - Implement SQL injection protection via parameterized queries in all database operations (get, set, clear, cleanup_expired, get_stats) - Add comprehensive input validation (null bytes, control characters, length limits) to prevent injection attacks - Implement path traversal protection restricting cache to ~/.sherlock - Add URL validation (max 2048 chars, no null bytes) - Store cache_duration per entry to prevent TTL drift across runs Code Quality (PEP 8 Compliance): - Fix import ordering: stdlib → third-party → local with blank line separators in cache.py, cache_cli.py, and sherlock.py - Replace Any type hints with specific unions (str|int, QueryStatus) - Remove shebang and __main__ block from cache_cli.py to prevent unsupported direct script execution Testing Improvements: - Replace file-based tests with unittest.mock (no disk I/O) - Remove time.sleep() calls, use mocked timestamps instead - Add security-specific tests (SQL injection, path traversal, null bytes) - Verify parameterized query usage in all database operations - Follow maintainer's testing patterns from feat/better_waf branch - Fix unused variable linting warnings (F841) Database Migration: - Add automatic schema migration for existing cache databases - Detect and handle old schema missing cache_duration column - Gracefully drop and recreate incompatible cache tables Platform Compatibility: - Verify Windows compatibility (Path.home() behavior documented) - Test Docker container build and execution - Confirm cross-platform path separator handling Test Results: - Linting: ✓ All checks passed - Cache tests: ✓ 14/14 passed - Docker build: ✓ Verified with act - Integration tests: 38/39 passed (1 flaky external site WAF)

obiwan04kanobi · 2025-10-06T15:29:25Z

Additionally, have any checks been performed for the possibility of SQLi?

This was not a complete review, as that'll take some time -- just some comments at first glance.

[PEP 8] (second bullet point, primarily, with a blank line separator between sections and two lines after the final import)

✅ Fixed - All database operations now use parameterized queries with ? placeholders. Added comprehensive input validation (null bytes, control characters, length limits). Tests verify parameterized query usage.

ppfeister

Attached a couple additional thoughts

Appreciate your patience and cooperation on this one. As a larger architectural change, I'd rather make sure it's right the first time rather than have things unexpectedly break.

ppfeister · 2025-10-07T03:26:45Z

docs/README.md

                [--site SITE_NAME] [--proxy PROXY_URL] [--json JSON_FILE]
                [--timeout TIMEOUT] [--print-all] [--print-found] [--no-color]
-                [--browse] [--local] [--nsfw]
+                [--browse] [--local] [--nsfw] [--no-cache] [--force-check]


Bikeshed:

Could we do --skip-cache and --ignore-cache? I feel like that just removes some ambiguity.

Like, what does --force-check even do? Of course I want to check these usernames. Maybe it bypasses username validation? (of course we know what it does)

Open to hearing your thoughts.

ppfeister · 2025-10-07T03:40:17Z

sherlock_project/cache.py

+        if cache_path is None:
+            cache_dir = Path.home() / ".sherlock"
+            cache_path = str(cache_dir / "cache.db")


Two items here...

Firstly, while .db is technically accurate, and extensions aren't important programmatically, it's always good to have proper hinting for the human factor. "What type of db is this random file?" Using .sqlite3 rather than .db may be ideal.

And it adds some complexity (you'd have to check for OS type), but it occurs to me now that this is also non-standard.

Ideally, *nix type systems should see this cache at ~/.cache/sherlock/cache.sqlite3 (while not defined by the FHS, it is defined by XDG BDS and is otherwise convention). This is sometimes defined by $XDG_CACHE_HOME, but many systems lack this variable (including my own, I'm now realizing) - but the above path is the expectation.

Similarly (while not a Windows user), I believe the convention there would expect %localappdata%\sherlock\cache.sqlite3. This also prevents dirtying user home, or otherwise having to set NTFS file attrs to make the directory hidden (since dotfiles aren't automatically hidden as they are on *nix).

Again, open to thoughts/concerns/etc.

On that second point, the package platformdirs may be useful. Would require a poetry add platformdirs

from platform dirs import user_cache_dir and user_cache_dir(<appname>, <appauthor>)

I have not validated this functionality, just did a cursory search. Pretty sure I've used it in the past.

Using something like

from platformdirs import PlatformDirs dirs = PlatformDirs("sherlock", "sherlock_project", version=__version__) cache_dir = dirs.user_cache_dir

may be the most portable and alleviate most schema migration concerns, due to built in versioning by path

ppfeister · 2025-10-07T03:44:42Z

sherlock_project/cache.py

+                # Migration: Check if cache_duration column exists
+                cursor.execute("PRAGMA table_info(results)")
+                columns = [row[1] for row in cursor.fetchall()]
+
+                if 'cache_duration' not in columns:
+                    # Add cache_duration column to existing table
+                    cursor.execute('''
+                        ALTER TABLE results 
+                        ADD COLUMN cache_duration INTEGER NOT NULL DEFAULT 86400
+                    ''')


Could we offload migration into a separate function?

Additionally, it's probably best to use PRAGMA user_version (starting at 1) to track schema versions automatically, where an older schema version can be automatically migrated as appropriate (rather than arbitrarily checking for the existence of certain cols)

The function can take no arguments and doesn't need to return anything at this time, just migrate as needed, or allow an exception to be raised if there's a fatal error.

ppfeister · 2025-10-07T03:49:08Z

sherlock_project/sherlock.py

+        # Cache the result if enabled
+        if cache and result.status in [QueryStatus.CLAIMED, QueryStatus.AVAILABLE]:
+            cache.set(
+                username=username,
+                site=social_network,
+                status=result.status,
+                url=result.site_url_user if result.status == QueryStatus.CLAIMED else None
+            )


Would this running on each username check rather than at the end lead to possible time-of-use issues?

Username 1 loads db state A
Username 2 loads db state A
Username 1 writes db state A->Z
Username 2 writes db state A->X, over top of Z

End result of db in state X, losing the changes made by Username 1

(if this is an issue..)

This doesn't necessarily have to be a post-run function either, which may fail to cache results after an early termination, it can be a parallelism-safe side function as well. But maybe failing to cache due to early termination is desirable. I can see arguments for both sides. I might lean towards do-not-cache at early termination and keeping it easy as a post-run fx, since most early terminations are bad runs anyhow

ppfeister · 2025-10-07T04:15:55Z

Additional nice-to-have ----

If we could configure cache settings by environment variable, that would be extremely useful in terms of automations and containerized environments

i.e.
SHERLOCK_CACHE_DISABLE=True (setting --skip-cache or whatever we call it)
SHERLOCK_CACHE_PATH=/xyz/something.sqlite3 (accepting either a full path with db name or a directory)
SHERLOCK_CACHE_TTL=3600

(can't imagine a need for the others)

Easy enough for this PR or should we break that out into a separate RFE?

obiwan04kanobi requested review from ppfeister and sdushantha as code owners October 5, 2025 07:13

fix: remove unused json import from cache.py

2b1aaeb

ppfeister requested changes Oct 5, 2025

View reviewed changes

ppfeister self-assigned this Oct 5, 2025

ppfeister added the enhancement New feature or request label Oct 5, 2025

obiwan04kanobi and others added 2 commits October 6, 2025 19:11

Merge branch 'sherlock-project:master' into feature/2219-sqlite-caching

d3425f2

obiwan04kanobi requested a review from ppfeister October 6, 2025 15:28

Removing tox from pyproject.toml

e6b6f04

ppfeister added this to the v0.17.0 milestone Oct 7, 2025

ppfeister requested changes Oct 7, 2025

View reviewed changes

ppfeister added the hacktoberfest-accepted label Oct 7, 2025

Uh oh!

Add SQLite caching with --no-cache and --force-check flags (#2219) #2608

Are you sure you want to change the base?

Add SQLite caching with --no-cache and --force-check flags (#2219) #2608

Conversation

obiwan04kanobi commented Oct 5, 2025

Description

Changes

Core Caching Implementation

CLI Arguments Added

Cache Management Utility

Testing & Documentation

Usage Examples

Performance Impact

Implementation Details

Cache Strategy

Database Schema

Testing

Related Issue

Checklist

Code of Conduct

Uh oh!

obiwan04kanobi commented Oct 5, 2025

Uh oh!

ppfeister left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

obiwan04kanobi commented Oct 6, 2025

Uh oh!

ppfeister left a comment

Choose a reason for hiding this comment

Uh oh!

ppfeister Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ppfeister Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ppfeister Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ppfeister Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ppfeister Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ppfeister Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

ppfeister commented Oct 7, 2025

Uh oh!

Uh oh!

ppfeister left a comment •

edited

Loading

ppfeister Oct 7, 2025 •

edited

Loading

ppfeister Oct 7, 2025 •

edited

Loading

ppfeister Oct 7, 2025 •

edited

Loading

ppfeister Oct 7, 2025 •

edited

Loading

ppfeister Oct 7, 2025 •

edited

Loading